-
Notifications
You must be signed in to change notification settings - Fork 112
I/O virtual memory (IOMMU) support #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I know why the code coverage CI check fails, because I (purposefully) don’t have unit tests for the new code. Why the other tests failed, I don’t know; but I suspect it’s because I force-pushed an update (fixing the CHANGELOG.md link) while the tests where running, maybe SIGTERM-ing them. Looking at the timelines, they all failed (finished) between 16:27:02.985 and 16:27:02.995. (Except the coverage one, which is an actual failure.) |
Pushed an update without actual changes (just re-committing the top commit) to trigger a CI re-run. This time, only the coverage check failed (as expected). |
5ee020b
to
4104d5f
Compare
Added tests for the new functionality, and rebased on the main branch. |
cc @germag |
Added support for bitmaps in I/O virtual address space (required for migration). |
src/io_memory.rs
Outdated
/// Underlying `GuestMemory` type. | ||
type PhysicalMemory: GuestMemory; | ||
/// Dirty bitmap type for tracking writes to the IOVA address space. | ||
type Bitmap: Bitmap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like the design here. The IOVA address space, being virtual, can have aliases. Could you still store the bitmap in the low-level memory region, but with accesses going through IOMMU translation?
What does vhost-user require? Are dirty bitmap accesses done in IOVA or GPA space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are dirty bitmap accesses done in IOVA or GPA space?
They’re done in IOVA space, which is why I think we need the bitmap on this level.
(If we continued to store it in e.g. GuestRegionMmap
/MmapRegion
, it would need a reverse translation, and then keep a mapping of address ranges to bitmap slices instead of just a single bitmap slice linearly covering the entire region.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They’re done in IOVA space, which is why I think we need the bitmap on this level.
I see. Maybe you could add a bitmap::BS<'a, Self::Bitmap>
as another argument that try_access()
passes back to the callback? And then IommuMemory::get_slice()
can pass that argument to replace_bitmap()
.
This way, the Iommu
controls whether dirty tracking is done in IOVA or GPA space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean putting the responsibility on the try_access()
callback to dirty a bitmap slice given to it if it has written anything?
I’m not sure, that does sound more than reasonable, but would require more changes in the callers than just to add the Permissions
flag. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean putting the responsibility on the try_access() callback to dirty a bitmap slice given to it if it has written anything?
No, I confused try_access
and get_slice
sorry. For IoMemory, dirtying is already done it in try_access
itself; for Iommu the slice could be returned by translate
, that is it would be part of the Iotlb entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m sorry, I don’t quite follow; currently, dirtying is done only by IommuMemory
(in its IoMemory
implementation), neither IoMemory
nor Iommu
.
<IommuMemory as IoMemory>::try_access()
dirties the bitmap itself, right.
<IommuMemory as IoMemory>::get_slice()
has the VolatileSlice
do the work. For this, it replaces the VolatileSlice
’s internal bitmap slice (which is initially set by the GuestMemoryRegion
) by the right slice in the IOVA space.
I don’t think the IOMMU object should do any of this, and I don’t think it should return slices (if I understand correctly). I think it should only do the translation and not care about the actual memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::get_slice() has the VolatileSlice do the work. For this, it replaces the VolatileSlice’s internal bitmap slice (which is initially set by the GuestMemoryRegion) by the right slice in the IOVA space.
To clarify, I was saying it should not be the IommuMemory
that decides the address space used by the bitmap - whether IOVA space as vhost wants or GPA space as everyone else (?) wants. I thought Iommu
would return the slice in the Iotlb
object, but I agree that probably Iommu
isn't the right place either.
That said I don't care much, because as long as IommuMemory
is hidden behind a feature, it's just a convenience or a sample implementation (and QEMU probably will not use IommuMemory
, only IoMemory
). Maybe just add a comment that says "NOTE: IommuMemory
can only be used if the dirty bitmap tracks accesses in IOVA space".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just add a comment that says "NOTE: IommuMemory can only be used if the dirty bitmap tracks accesses in IOVA space".
To be precise, it depends on the IOMMU enabled setting: With it enabled, it’s IOVA space, otherwise it’s whatever the address space of the underlying GuestMemory
is (for vhost, that would be VUA).
I could speculate on how it works for non-vhost (specifically in-hypervisor) cases; maybe the underlying memory will use GPA in that case, and therefore we can (and have to) track dirty accesses in GPA then. To make that work, we would only need a separate flag to control whether the IOVA dirty bitmap is supposed to be used or not.
But I think the best would be for me not to speculate and just leave this as it is – if such a separate flag would be enough, we can just add it later. For the time being, as you suggest, a comment that specifies the dirty bitmap behavior should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is good stuff. It also has a lot of overlap with the experimental changes that were necessary in order to use vm-memory in QEMU, which is a very good thing.
I have two main questions:
- do we really need an
IoMemory
or can we changeGuestMemory
? - if we need an
IoMemory
, I think thePhysicalMemory: GuestMemory
type instead be anAS: GuestAddressSpace
? (and likewise forphysical_memory()
?
I'm marking this as "request changes" because there would be changes either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extract the first three commits, up to "Bytes: Do not use to_region_addr()" in a separate PR.
+1 to changing |
No, I don't think that is a good idea. IOVA and GPA are the same concept, though in different address spaces. My hope when leaving the review was that no changes are needed outside vm-memory, and the interpretation of GuestAddress can be left to whoever creates the GuestMemory. |
The main practical problem with using different types (as stated in the opening comment) is that users generally don’t even know whether a given address is a GPA or an IOVA. For example, the addresses you get from vrings and vrings descriptors can be either, it just depends on whether the IOMMU is enabled or not. Users should be able to use the same code whether it is on or off, which (I think) wouldn’t really be possible with different types.
I hope Paolo will add a comment to that effect (we just had a talk); we agreed that adding I don’t like keeping (Paolo (maybe half-jokingly) suggested making the implementation of that a linking-time error by accessing undefined references in a hypothetical |
Yes, GuestMemory is a nice interface for implementation but clients should switch to |
Different question about this: Do we really need the |
Doesn’t sound bad, but would be a more “radical change”: Intermediate users like virtio-queue currently use memory types via If we change patterns like |
Mh, potentially this wouldn't be that much churn, because we could have Somewhat related question, since I'm seeing the |
@bonzini: Here are some repos (renaming commits marked as “[preliminary]”):
There’s a comment on one of the vm-memory commits about the But reading it again, I’m no longer so sure. I’m bringing this up so someone can make it clear to me :)
So all in all, I understand the switch makes sense to nudge all users to change their code so they can work with virtual memory if that becomes necessary; and the new trait still allows using all existing (physical) types, so it would all remain But on the other hand, as far as I can see, in all cases where we need to be able to access virtual memory it seems to be about I/O device memory, so the name |
I'm not sure about that. How to allocate memory is a choice of the VMM, not the boot loader. Suppose a board has RAM at 0..2G and 4G..8G. Whether to allocate it as a single mmap or two should not be the boot loader's choice. If two (which is what something like firecracker would do) you can use A boot loader should also make sure to prepare the guest's memory view in such a way that the guest, after view, sees exactly what was prepared by the boot loader. Take as an example the PC's ability to remap the memory below 1MB to the corresponding area below 4GB. If a boot loader uses Also, CPUs can have their own memory view, which is not the same as either the pure memory backend and not the same as any device. For example on x86 the local APIC is only visible to the CPUs. So we could call your Anyhow, looking at
|
One further place where I want to use the new trait is for the Xen code. With Xen Grant mappings, vm-memory needs to know whether an access from the hypervisor is a read or a write, and issue an ioctl with that information before the memory can be accessed. Currently, this is hacked into |
OK, thanks for the explanations! I’ve rebased on top of #339 (and because the atomic access implementations now require that |
@XanClic needs another rebase 😅 |
Also, given the intention to rename everything I would not create a separate io_memory module for IoMemory, and would instead create an iommu module for IommuMemory and its friends. |
f4889ed
to
4671a49
Compare
bb70b4f
to
c955ffd
Compare
Update with rebase on main:
|
src/guest_memory.rs
Outdated
/// | ||
/// Returned by [`IoMemory::get_slices()`]. | ||
pub trait IoMemorySliceIterator<'a, B: BitmapSlice>: | ||
Iterator<Item = Result<VolatileSlice<'a, B>>> + Sized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterator<Item = Result<VolatileSlice<'a, B>>> + Sized | |
Iterator<Item = Result<VolatileSlice<'a, B>>> + Sized + FusedIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Yes, the current implementations do provide that. But is it something we need the trait to require?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, because you want any other implementations to be consistent. It's just a marker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If another implementation can’t provide FusedIterator
for free, this would incur overhead, though.
As far as I understand FusedIterator
, users who need it can just use .get_slices()?.fused()
, and this should be zero-cost if the concrete type does implement FusedIterator
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I think generally speaking fused behavior is desirable. Chances are that users will expect fused behavior and work only on some backends. (Also I don't see how a IoMemorySliceIterator could magically restart providing values after a None
... almost all iterators should be fused IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t know what to say, that’s just wrong. (Edit: “that” being assuming all iterators are fused)
As I see it, Rust as a language has chosen iterators to generally not be fused, and requiring it in this trait would go against that design decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that was not really a great way to put it so let me rephrase.
IoMemorySliceIterator
is returned by get_slices()
. Users of get_slices()
should know what semantics to expect if the iterator starts providing items again after None
. The documentation should say that, while making it clear that returning an iterator with fused behavior is acceptable too.
If there is a reasonable description of what to expect, then the current code is fine but the documentation for get_slices()
is lacking. I don't think this is the case, the documentation is fine.
But then, returning a FusedIterator
should be part of get_slices()
's contract, and + FusedIterator
should be in either get_slices()
or IoMemorySliceIterator
. The former seems unnecessarily pedantic, so I'd put it in the trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, for unfused iterators to be useful, you need to know when to try calling .next()
again and what the result would be. Generally they are used for "resuming" iterators, like std::sync::mpsc::TryIter
. I don't think get_slices()
should return such a resuming iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I’ll add the FusedIterator
bound, but I disagree with this specific reasoning. In short: I still don’t think it’s actually useful, but I think it’s necessary.
That is, for unfused iterators to be useful, you need to know when to try calling .next() again and what the result would be.
I really don’t see why that would be the case. You run .next()
until it returns None
, then you don’t use the iterator any more. In my experience, that is the absolute vast majority of use cases, and it doesn’t matter how the iterator behaves after the first None
.
Consequently, I think it can be fine for the documentation to leave open what next()
does once iteration is finished if that is deliberately left implementation-defined. However, of course, clearer documentation covering more edge cases is always better.
As for users requiring fused behavior, FusedIterator
’s documentation (as I understand it) says not to rely on bounds, but always use .fuse()
.
I also disagree with the implication that if the iterator isn’t fused, it must be resuming. As you say, “generally” non-fused iterators could be resuming, not always. Other behavior is perfectly legal if it’s deemed more useful, and I would say largely undefined behavior is also OK if absolutely required for optimization purposes.
Having said all that, what you do make me see is that I should deliberate what behavior I actually do want. Implementation-defined (effectively undefined) behavior comes at a cost, I can’t just choose this by default because I’m lazy: Safe code is allowed to call .next()
after the iterator is exhausted, and while that is wrong if you don’t know what to expect, it’s still legal safe code, so it is even more wrong for me to shrug my shoulders and say “whatever, don’t do that”.
Now, if such undefined behavior actually had a good reason to it, e.g. “It’s really important for performance here”, OK, but I don’t have such a reason. The only actual reason I have is that once we add that bound, removing it would be an incompatible change. Not very strong.
So yes, I really should at least consider what behavior makes sense after iterator exhaustion, and specify it. And here I agree that fused behavior is indeed the only behavior that does make sense, so let’s add that bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, in fact I agree with all that you wrote. :) Including this:
I still don’t think it’s actually useful, but I think it’s necessary.
Thanks for writing it down better than I did.
/// | ||
/// Disabling the IOMMU switches to pass-through mode, where every access is done directly on | ||
/// the underlying physical memory. | ||
pub fn set_iommu_enabled(&mut self, enabled: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a getter too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
src/iommu.rs
Outdated
/// Note that the inner `Arc` reference to the IOMMU is cloned, i.e. both the existing and the | ||
/// new `IommuMemory` object will share an IOMMU instance. (The `use_iommu` flag however is | ||
/// copied, so is independent between the two instances.) | ||
pub fn inner_replaced(&self, inner: M) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly you'd do something like
// `as` is a GuestMemoryAtomic
let g = as.lock().unwrap();
let new_iommu = as.load().inner_replaced(new_memory);
g.replace(new_iommu);
Maybe we can add an
impl<M: GuestMemory, I: IOMMU>
GuestMemoryExclusiveGuard<'_, IommuMemory<M, I>> {
/// Replace the memory map in the `GuestMemoryAtomic` that created
/// the guard with the new memory map, `map`, while keeping the IOMMU.
/// This method consumes the guard.
pub fn replace_backend(self, map: M) {
let old = self.parent.inner.0.load();
let new = old.inner_replaced(map);
self.replace(new);
}
}
(and TBH I'd love for inner_replaced
to be pub(crate)
but maybe that's asking too much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that’s the pattern.
(In practice, inner_replaced()
is called through a helper trait (VirtualMemory
) that vhost-user-backend implements on both IommuMemory
and GuestMemoryMmap
, which provides a with_physical_memory()
method that uses inner_replaced()
for IommuMemory
and just returns the new memory for GuestMemoryMmap
.)
Why do you want to add that method?
Also, it isn’t generic across IoMemory
, so it could only be used specifically for IommuMemory
. I suppose we could add a trait Foo<M: IoMemory>
with this method, and implement it for multiple GuestMemoryExclusiveGuards<M>
. Then, vhost-user-backend could drop that with_physical_memory()
method (but still needs the trait anyway). That alone doesn’t seem worth the effort here to me, but that’s why I’m wondering why you would like to have that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to add that method?
Because right now GuestMemoryAtomic
focuses on replacing the whole GuestMemory
, but there is a usecase for replacing the backend only.
Also, it isn’t generic across IoMemory, so it could only be used specifically for IommuMemory
Yes, it's an extra impl
only for IommuMemory
.
That alone doesn’t seem worth the effort here to me, but that’s why I’m wondering why you would like to have that method
More for clarity/documentation than anything else. inner_replaced
was a bit mysterious (maybe it's just the name, with_replaced_backend
or anything along those lines would help me too).
I see now that it can't be pub(crate)
though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a usecase for replacing the backend only.
True, but if we wanted to really provide for that use case, we’d need a generic IoMemory
implementation (at least for all types in vm-memory).
inner_replaced
was a bit mysterious (maybe it's just the name,with_replaced_backend
or anything along those lines would help me too).
Sure, I’d be more than happy to (and also rename .inner()
to .get_backend()
). I’ll definitely do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's do the rename and agree to disagree on the rest. :)
src/mmap/mod.rs
Outdated
/// The returned reference can be cloned to construct a new `GuestRegionMmap` with a different | ||
/// base address (e.g. when switching between memory address spaces based on the guest physical | ||
/// address vs. the VMM userspace virtual address). | ||
pub fn get_mapping(&self) -> &Arc<MmapRegion<B>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather call this get_mmap()
and do the clone itself returning an Arc<MmapRegion<B>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few requests, but mostly I think it's ready.
src/iommu.rs
Outdated
/// space. | ||
#[derive(Debug, Default)] | ||
/// | ||
/// Note on memory write tracking (“logging”): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a note here for simplicity, but it applies to the whole commit.
There are cases (e.g. QEMU :)) in which you want to track at the backend level independent of whether you have an IOMMU or not. The vhost case is the weird one due to its usage of VMM userspace addresses. I'd prefer that the comments and commit messages put less emphasis on this, explaining that there are simply two levels of bitmaps.
Related to this, I'm not super enthusiastic about forcing the usage of Arc<> for the bitmap, because having an Arc<()> in the simple case is a bit weird. However, all the derefs should be optimized away so I guess it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that the comments and commit messages put less emphasis on this, explaining that there are simply two levels of bitmaps.
I understood this type to be specific to vhost’s model, hence this note saying so. Now, it will probably be easy to modify it slightly to allow for QEMU’s model as well, basically IommuMemorySliceIterator::do_next()
would just have to skip replacing the VolatileSlice
’s bitmap. But I don’t see the point in doing that now, as I don’t have the knowledge if this is really what’s needed (and I can’t test it), and such a change could be a compatible one anyway[1].
So I don’t know what you would like to change specifically. Of course I can make it clear in the commit message that this is just for the vhost case, but I’m a bit lost beyond that.
[1] Add a method disable_virtual_bitmap()
, pass this state on to all IommuMemorySliceIterator
s having them skip replacing the bitmap, and this note would be modified to say “If you don’t want that, call disable_virtual_bitmap()
”. Yes, if “not wanting that” is the default case, that’s a bit weird, but not the end of the world. But doing anything else (e.g. implement that functionality now, adding a boolean to IommuMemory::new()
) would require someone to vouch that that’s really what’s needed and that it would really be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood this type to be specific to vhost’s model, hence this note saying so.
It is not used by QEMU but it may be used by other users than vhost. This handling of dirty bitmap is not even required for IommuMemory; I just want to make it clear that it's vhost that needs to have the bitmaps organized this way. IommuMemory allows that but it's possible to use IommuMemory while having bitmaps only at the physical address level.
But doing anything else (e.g. implement that functionality now, adding a boolean to IommuMemory::new()) would require someone to vouch that that’s really what’s needed and that it would really be useful.
Absolutely. All that I'm asking is to not "discount" the case where the IommuMemory's bitmap is ()
in the documentation.
Of course I can make it clear in the commit message that this is just for the vhost case
Is it clearer now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make it clear that it's vhost that needs to have the bitmaps organized this way.
I.e. modify the comment to not just say “this type should only be used when this is the desired behavior” but explicitly something like “This is specifically the vhost-user model”?
All that I'm asking is to not "discount" the case where the IommuMemory's bitmap is () in the documentation.
Hard, because the bitmap between the back-end and IommuMemory
must be the same type, or we couldn’t replace it in VolatileSlice
. Would you like the bitmap to be wrapped in Option<>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard, because the bitmap between the back-end and IommuMemory must be the same type, or we couldn’t replace it in VolatileSlice.
Ouch, I see. Yeah, change the comment so that others don't fall into the same mistake as me and make it clearer that IommuMemory wants the two-bitmap scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be useful to have the bitmap be optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have any ideas we can have a separate PR before the 0.18.0 release. Overall for IommuMemory it's okay, as the whole struct is just a convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense.
The existing `GuestMemory` trait is insufficient for representing virtual memory, as it does not allow specifying the required access permissions. Its focus on all guest memory implementations consisting of a relatively small number of regions is also unsuited for paged virtual memory with a potentially very lage set of non-continuous mappings. The new `IoMemory` trait in contrast provides only a small number of methods that keep the implementing type’s internal structure more opaque, and every access needs to be accompanied by the required permissions. Signed-off-by: Hanna Czenczek <[email protected]>
Rust only allows us to give one trait the blanket implementations for `Bytes`. We want `IoMemory` to be our primary external interface becaue it has users specify the access permissions they need, and because we can (and do) provide a blanket `IoMemory` implementation for all `GuestMemory` types. Also, while `IoMemory` (as the more general trait) only has a restricted interface when compared to `GuestMemory`, this interface is enough to implement `Bytes`; notably, accesses to `IoMemory` require specifying the access mode, which is naturally trivial for `Bytes` methods like `read()` or `write()`. Signed-off-by: Hanna Czenczek <[email protected]>
We want a trait like `GuestAddressSpace` for `IoMemory`, but just duplicating it into an `IoAddressSpace` trait is not so easy: We could rename the current `GuestAddressSpace` to `IoAddressSpace` and require `M: IoMemory` (instead of `M: GuestMemory`), and then define `GuestAddressSpace` as: ```rust pub trait GuestAddressSpace: IoAddressSpace<M: GuestMemory> {} impl<AS: IoAddressSpace> GuestAddressSpace for AS where AS::M: GuestMemory, {} ``` But doing just this would break all existing `GuestAddressSpace` users, as they’d now need to import `IoAddressSpace` to use `memory()`. (Re-)Adding `GuestAddressSpace::memory()` as ```rust fn memory(&self) -> <Self as IoAddressSpace>::T { IoAddressSpace::memory(self) } ``` also doesn’t (just) work, as it gets the compiler confused which `memory()` to use (between `GuestAddressSpace` and `IoAddressSpace`), so the `IoAddressSpace::memory()` method would need to be called differently. However, I would find that a bit silly, and it would also then later require changes if the user wants to switch from `GuestMemory` to `IoMemory`. Instead just changing the `GuestAddressSpace::M: GuestMemory` requirement to `M: IoMemory` seems easier: - All callers that just use the `Bytes` interface remain completely unchanged, - It does break users that actually need the `GuestMemory` interface, but from what I have seen, that is only the case for `vhost::vhost_kern`. There, we can simply require that `<AS as GuestAddressSpace>::M: GuestMemory`. Signed-off-by: Hanna Czenczek <[email protected]>
This simply makes `GuestMemoryAtomic` more general. (However, this change requires the preceding commit that relaxed the `GuestAddressSpace::M` requirement from `GuestMemory` to `IoMemory`.) Signed-off-by: Hanna Czenczek <[email protected]>
The Iommu trait defines an interface for translating virtual addresses into addresses in an underlying address space. It is supposed to do so by internally keeping an instance of the Iotlb type, updating it with mappings whenever necessary (e.g. when actively invalidated or when there’s an access failure) from some internal data source (e.g. for a vhost-user IOMMU, the data comes from the vhost-user front-end by requesting an update). In a later commit, we are going to provide an implementation of `IoMemory` that can use an `Iommu` to provide an I/O virtual address space. Note that while I/O virtual memory in practice will be organized in pages, the vhost-user specification makes no mention of a specific page size or how to obtain it. Therefore, we cannot really assume any page size and have to use plain ranges with byte granularity as mappings instead. Signed-off-by: Hanna Czenczek <[email protected]>
This `IoMemory` type provides an I/O virtual address space by adding an IOMMU translation layer to an underlying `GuestMemory` object. Signed-off-by: Hanna Czenczek <[email protected]>
The vhost-user-backend crate will need to be able to modify all existing memory regions to use the VMM user address instead of the guest physical address once the IOMMU feature is switched on, and vice versa. To do so, it needs to be able to modify regions’ base address. Because `GuestMemoryMmap` stores regions wrapped in an `Arc<_>`, we cannot mutate them after they have been put into the `GuestMemoryMmap` object; and `MmapRegion` itself is by its nature not clonable. So to modify the regions’ base addresses, we need some way to create a new `GuestRegionMmap` referencing the same `MmapRegion` as another one, but with a different base address. We can do that by having `GuestRegionMmap` wrap its `MmapRegion` in an `Arc`, and adding a method to return that `Arc`, and a method to construct a `GuestRegionMmap` object from such a cloned `Arc.` Signed-off-by: Hanna Czenczek <[email protected]>
Without an IOMMU, we have direct access to guest physical addresses (GPAs). In order to track our writes to guest memory (during migration), we log them into dirty bitmaps, and a page's bit index is its GPA divided by the page size. When it comes to vhost-user, however, and we use an IOMMU, we no longer know the GPA, instead we operate on I/O virtual addresses (IOVAs) and VMM user-space addresses (VUAs). Here, the dirty bitmap bit index is the IOVA divided by the page size. `IoMemory` types contain an internal "physical" memory type that (in case of vhost-user) operates on these VUAs (`IoMemory::PhysicalMemory). Any bitmap functionality that this internal type may already have (e.g. `GuestMemoryMmap` does) cannot be used for vhost-user dirty bitmap tracking with an IOMMU because they would use the VUA, but we need to use the IOVA, and this information is not available on that lower layer. Therefore, `IoMemory` itself needs to support bitmaps separately from its inner `PhysicalMemory`, which will be used when the IOMMU is in use. Add an associated `IoMemory::Bitmap` type and add a bitmap object to `IommuMemory`. Ensure that writes to memory dirty that bitmap appropriately: - In `try_access()`, if write access was requested, dirty the handled region of the bitmap after the access is done. - In `get_slice()`, replace the `VolatileSlice`'s bitmap (which comes from the inner `PhysicalMemory`) by the correct slice of our IOVA bitmap before returning it. Signed-off-by: Hanna Czenczek <[email protected]>
This commit also adds the iommu feature to the coverage_config feature list. (I left the aarch64 coverage value unchanged; I cannot find out how to get the current value on my system, and it isn’t include in CI.) Signed-off-by: Hanna Czenczek <[email protected]>
Document in DESIGN.md how I/O virtual memory is handled. Signed-off-by: Hanna Czenczek <[email protected]>
Signed-off-by: Hanna Czenczek <[email protected]>
Addressed Paolo’s comments:
|
Summary of the PR
This MR adds support for an IOMMU, and thus I/O virtual memory handling.
New Memory Trait:
IoMemory
Handling I/O virtual memory requires a new interface to access guest memory:
GuestMemory
does not allow specifying the required access permissions, whichis necessary when working with MMU-guarded memory.
We could add memory access methods with such a permissions parameter to
GuestMemory
, but I prefer to provide a completely new trait instead. Thisensures that users will only use the interface that actually works when working
with (potentially) I/O virtual memory, i.e.:
generally assumes that regions are long, continuous, and any address in a
given range will be in the same memory region. This is absolutely no longer
the case with virtual memory, which is heavily fragmented into pages.
That is, adding a new trait (
IoMemory
) allows to catch a lot of potentialmistakes at compile time, which I feel is much better than finding out at
runtime that some place forgot to specify the access permissions.
Unfortunately, this is an incompatible change because we need to decide on a
single guest memory trait that we expect users to primarily use: We can only
have one blanket implementation of e.g.
Bytes
, and this MR changes thatblanket implementation to be on
IoMemory
instead ofGuestMemory
because wewant to prefer
IoMemory
with its permissions-including interface.While this MR does provide a blanket implementation of
IoMemory
for allGuestMemory
, Rust isn’t fully transitive here, so just because we have ablanket
impl IoMemory for GuestMemory
and a blanketimpl Bytes for IoMemory
doesn’t really implicitly give us an
impl Bytes for GuestMemory
.What this means can be seen in virtio-queue (in vm-virtio): It uses trait bounds
like
M: GuestMemory
only, but then expects to be able to use theBytes
trait. This is no longer possible, the trait bound must be extended to
M: GuestMemory + Bytes
or replaced byM: IoMemory
(the latter is what wewant).
Guest Address Type
Another consideration is that I originally planned to introduce new address
types.
GuestAddress
currently generally refers to a guest physical address(GPA); but we now also need to deal with I/O virtual addresses (IOVAs), and an
IOMMU generally doesn’t translate those into GPAs, but VMM user space addresses
(VUAs) instead, so now there’s three kinds of addresses. Ideally, all of those
should get their own type; but I felt like:
IoMemory
object is anIOVA or a GPA. It depends on whether the IOMMU is enabled or not, which is
generally a runtime question.
Therefore, I kept
GuestAddress
as the only type, and it may refer to any ofthe three kinds of addresses (GPAs, IOVAs, VUAs).
Async Accesses
I was considering whether to also make memory accesses optionally
async
. Thevhost-user IOMMU implementation basically needs two vhost-user socket roundtrips
per IOTLB miss, which can make guest memory accesses quite slow. An
async
implementation could allow mitigating that.
However, I decided against it (for now), because this would also require
extensive changes in all of our consuming crates to really be useful: Anything
that does a guest memory access should then be
async
.I think if we want to add this functionality later, it should be possible in a
compatible manner.
Changes Necessary in Other Crates
vm-virtio
Implementation: https://gitlab.com/hreitz/vm-virtio/-/commits/iommu
As stated above, places that bind
M: GuestMemory
but expect theBytes
traitto also be implemented need to be changed to
M: GuestMemory + Bytes
orM: IoMemory
. I opted for the latter approach, and basically replaced allGuestMemory
instances byIoMemory
.(That is what we want because dropping
GuestMemory
in favor ofIoMemory
ensures that all vm-virtio crates can work with virtual memory.)
vhost
Implementation: https://gitlab.com/hreitz/vhost/-/commits/iommu
Here, the changes that updating vm-memory necessitates are quite marginal, and
have a similar cause: But instead of requiring the
Bytes
trait, it’s theGuestAddressSpace
trait. The resolution is the same: Switch from requiringGuestMemory
toIoMemory
.The rest of the commits concerns itself with implementing
VhostUserIommu
andallowing users to choose to use
IommuMemory<GuestMemoryMmap, VhostUserIommu>
instead of only
GuestMemoryMmap
.virtiofsd (as one user)
Implementation: https://gitlab.com/hreitz/virtiofsd-rs/-/commits/iommu
This is an example of an actual user. Updating all crates to IOMMU-supporting
versions actually does not require any changes to the code, but enabling the
'iommu' feature does: This feature makes the vhost-user-backend crate require
the
VhostUserBackend::Memory
associated type (because associated type defaultsare not stable yet), so this single line of code must be added (which sets the
type to
GuestMemoryMmap<BitmapMmapRegion>
).Actually enabling IOMMU support is then a bit more involved, as it requires
switching away from
GuestMemoryMmap
toIommuMemory
again.However, to me, this shows that end users working with concrete types do not
seem to be affected by the incompatible
IoMemory
change until they want to optin to it. That’s because
GuestMemoryMmap
implements bothGuestMemory
andIoMemory
(thanks to the blanket impl), so can transparently be used whereverthe updated crates expect to see an
IoMemory
type.Requirements
Before submitting your PR, please make sure you addressed the following
requirements:
git commit -s
), and the commit message has max 60 characters for thesummary and max 75 characters for each description line.
test.
Release" section of CHANGELOG.md (if no such section exists, please create one).
unsafe
code is properly documented.